Multi-Language Support for Interfacing with Distributed Data

ABSTRACT

A data analysis system stores in-memory representation of a distributed data structure across a plurality of processors of a parallel or distributed system. Client applications interact with the in-memory distributed data structure to process queries using the in-memory distributed data structure and to modify the in-memory distributed data structure. The data analysis system creates uniform resource identifier (URI) to identify each in-memory distributed data structure. The URI can be communicated from one client application to another application using communication mechanisms outside the data analysis system, for example, by email, thereby allowing other client devices to interact with a particular in-memory distributed data structure. The in-memory distributed data structure can be a machine learning model that is trained by one client device and executed by another client device. A client application can interact with the in-memory distributed data structure using different programming languages.

CROSS REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.14/814,457, which claims the benefits of U.S. Provisional ApplicationNo. 62/086,158 filed on Dec. 1, 2014, both of which are herebyincorporated by reference in their entirety.

BACKGROUND

The disclosure relates to efficient processing of large data sets usingparallel and distributed systems. More specifically, the disclosureconcerns various aspects of processing of distributed data structuresincluding collaborative processing of distributed data structures usingshared documents.

Enterprises produce large amount of data based on their dailyactivities. This data is stored in a distributed fashion among a largenumber of computer systems. For example, large amount of information isstored as logs of various systems of the enterprise. Processing suchlarge amount of data to gain meaningful insights into the informationdescribing the enterprise requires large amount of resources.Furthermore, conventional techniques available for processing such largeamount of data typically require users to perform cumbersomeprogramming.

Furthermore, users have to deal with complex systems that performparallel/distributed programming to be able to process such large amountof data. Software developers and programmers (also referred to as dataengineers) who are experts at programming and using such complex systemstypically do not have the knowledge of a business expert or a datascientist to be able to identify the requirements for the analysis. Norare the software developers able to analyze the results on their own.

As a result, there is a gap between the process of identifyingrequirements and analyzing results and the process of programming theparallel/distributed systems to achieve the results. This gap results intime consuming communications between the business experts/datascientists and the data engineers. Data scientists, business experts, aswell as data engineers act as resources of an enterprise. As a resultthe above gap adds significant costs to the process of data analysis.Furthermore, this gap leads to possibilities of errors in the analysissince a data engineer can misinterpret certain requirements and maygenerate incorrect results. The business experts or the data scientistsdo not have the time or the expertise to verify the software developedby the developers to verify its accuracy.

Some tools and systems are available to assist data scientists andbusiness experts with the above process of providing requirements andanalyzing results of big data analysis. The tools and systems used bydata scientists are typically difficult for business experts to use andtools and systems used by business experts are difficult for datascientists to use. This creates another gap between the analysisperformed by data scientists and the analysis performed by businessexperts. Therefore conventional techniques for providing insights intobig data stored in distributed systems of an enterprise fail to providesuitable interface for users to analyze the information available in theenterprise.

SUMMARY

Embodiments support multi-language support for data processing. A systemstores an in-memory distributed data frame structure (DDF) across aplurality of compute nodes. Each compute node stores a portion of thein-memory distributed data structure (DDF segment). The data of the DDFconforms to a primary language. The system further stores a documentcomprising text and code blocks. The code blocks comprise a first codeblock for providing instructions using the primary language and a secondcode block for providing instructions using a secondary language.

The system receives a request to process instructions specified in thefirst code block using the primary language. Each compute node processesthe instructions to process the DDF segment mapped to the compute node.The system further receives a request to process instructions specifiedin the second code block using the secondary language. Each compute nodetransforms the data of the DDF segment mapped to the compute node toconform to the format of the secondary language. Each compute nodeexecutes the instructions of the secondary language to generate a resultDDF segment. The system transforms data of the result DDF segment to aformat conforming to the primary language. Each compute node processesfurther instructions specified using the primary language to process thetransformed result DDF segment mapped to the compute node.

BRIEF DESCRIPTION OF DRAWINGS

The disclosed embodiments have other advantages and features which willbe more readily apparent from the detailed description, the appendedclaims, and the accompanying figures (or drawings). A brief introductionof the figures is below.

FIG. 1 shows the overall system environment for performing analysis ofbig data, in accordance with an embodiment of the invention.

FIG. 2 shows the system architecture of a big data analysis system, inaccordance with an embodiment.

FIG. 3 shows the system architecture of the distributed data framework,in accordance with an embodiment.

FIG. 4 illustrates a process for collaborative processing based on a DDFbetween users, according to an embodiment of the invention.

FIG. 5 illustrates a process for converting immutable data sets receivedfrom the in-memory cluster computing engine to mutable data sets,according to an embodiment of the invention.

FIG. 6 shows the process illustrating use of a global editing mode and alocal editing mode during collaboration, according to an embodiment ofthe invention.

FIG. 7 shows collaborative editing of documents including text, code,and charts based on distributed data structures, according to anembodiment.

FIG. 8 shows an example of shared document with code and results basedon code, according to an embodiment.

FIG. 9 shows an example of shared document with code and results updatedbased on execution of the code, according to an embodiment.

FIG. 10 shows an example of a dashboard generated from a document basedon distributed data structures, according to an embodiment.

FIG. 11 shows the architecture of the in-memory cluster computing engineillustrating how a distributed data structure (DDF) is allocated tovarious compute nodes, according to an embodiment.

FIG. 12 shows the architecture of the in-memory cluster computing engineillustrating how multiple runtimes are used to process instructionsprovided in multiple languages, according to an embodiment.

FIG. 13 shows an interaction diagram illustrating the processing of DDFsbased on instructions received in multiple languages, according to anembodiment.

FIG. 14 is a high-level block diagram illustrating an example of acomputer for use as a system for performing formal verification with lowpower considerations, in accordance with an embodiment.

The features and advantages described in the specification are not allinclusive and in particular, many additional features and advantageswill be apparent to one of ordinary skill in the art in view of thedrawings, specification, and claims. Moreover, it should be noted thatthe language used in the specification has been principally selected forreadability and instructional purposes, and may not have been selectedto delineate or circumscribe the disclosed subject matter.

DETAILED DESCRIPTION

A big data analysis system provides an abstraction of database tablesbased on distributed in-memory data structures obtained from big datasources, for examples, files of a distributed file system. A user canretrieve data from a large, distributed, complex file system and treatthe data as database tables. As a result, the big data analysis systemallows users to use familiar data abstractions such as filtering data,joining data for large datasets that are commonly supported by systemsthat handle with small data, for example, single processor databasemanagement systems. The big data analysis system supports variousfeatures including schema, filtering, projection, transformation, datamining, machine learning, and so on.

Embodiments support various operations commonly used by data scientists.These include various types of statistical computation, sampling,machine learning, and so on. However, these operations are supported onlarge data sets processed by a distributed architecture. Conventionalsystems that allow processing of large data using distributed systemsEmbodiments create a long term session for a user and track distributeddata structures for users so as to allow users to modify the distributeddata structures. The ability to create long term sessions allowsembodiments to provide functionality similar to existing data analysissystems that are used for small data processing, for example, the Rprogramming language and interactive system.

Furthermore, embodiments support high-level data analyticsfunctionality, thereby allowing users to focus on the data analysisrather than low level implementation details of how to manage large datasets. This is distinct from conventional systems, for example, systemsthat support map-reduce paradigm and require user to express high-levelanalytics functions into map and reduce functions. The map reduceparadigm requires users to be aware of the distributed nature of dataand requires users to use the map and reduce operations for expressingthe data analysis operations.

Embodiments further allow integration of large data sets with variousmachine learning techniques, for example, with externally availablemachine learning libraries. Furthermore, the ability to storedistributed data structures in memory and identify the distributed datastructures using URI allows embodiments to support clients using variouslanguages, for example, Java, Scala, R and Python, and also naturallanguage.

Embodiments support collaboration between multiple users working on thesame distributed data set. A user can refer to a distributed datastructure using a URI (uniform resource identifier). The URI can bepassed between users, for example, by email. Accordingly, a new user canget access to a distributed data structure that is stored in memory.Embodiments further allow a user to train a machine learning model,create a name for the machine learning model and transfer the name toanother user so as to allow the other user to execute the machinelearning model. For example, a data scientist can create a distributedata structure or a machine learning model and provide to an executiveof an enterprise to present the data or model to an audience. Theexecutive can perform further processing using the data or the model aspart of a presentation by connecting to a system based on theseembodiments.

Reference will now be made in detail to several embodiments, examples ofwhich are illustrated in the accompanying figures. It is noted thatwherever practicable similar or like reference numbers may be used inthe figures and may indicate similar or like functionality. The figuresdepict embodiments of the disclosed system (or method) for purposes ofillustration only. One skilled in the art will readily recognize fromthe following description that alternative embodiments of the structuresand methods illustrated herein may be employed without departing fromthe principles described herein.

FIG. 1 shows the overall system environment for performing analysis ofbig data, in accordance with an embodiment of the invention. The overallsystem environment includes an enterprise 110, a big data analysissystem 100, a network 150 and client devices 130. Other embodiments canuse more or less or different systems than those illustrated in FIG. 1.Functions of various modules and systems described herein can beimplemented by other modules and/or systems than those described herein.

FIG. 1 and the other figures use like reference numerals to identifylike elements. A letter after a reference numeral, such as “120 a,”indicates that the text refers specifically to the element having thatparticular reference numeral. A reference numeral in the text without afollowing letter, such as “120,” refers to any or all of the elements inthe figures bearing that reference numeral (e.g. “120” in the textrefers to reference numerals “120” and/or “120” in the figures).

The enterprise 110 is any business or organization that uses computersystems for processing its data. Enterprises 110 are typicallyassociated with a business activity, for example, sale of certainproducts or services but can be any organization or groups oforganizations that generates significant amount of data. The enterprise110 includes several computer systems 120 for processing information ofthe enterprise. For example, a business may use computer systems forperforming various tasks related to the products or services offered bythe business. These tasks include sales transactions, inventorymanagement, employee activities, workflow coordination, informationtechnology management, and so on.

Performing these tasks may generate large amount of data for theenterprise. For example, an enterprise may perform thousands oftransactions daily. Different types of information is generated for eachtransaction including information describing the product/servicesinvolved in the transaction, errors/warning generated by the systemduring transactions, information describing involvement of personnelfrom the enterprise, for example, sales representative, technicalsupport, and so on. This information accumulates over days, weeks,months, and years, resulting in large amount of data.

For example, airlines process data of hundreds of thousands ofpassengers traveling every day and large numbers of flights carryingpassengers every day. The information describing the flights andpassengers of each flight over few years can be several terabytes ofdata. Enterprises that process petabytes of data are not uncommon.Similarly, search engines may store information describing millions ofsearches performed by users on a daily basis that can generate terabytesof data in a short time interval. As another example, social networkingsystems can have hundreds of millions of users. These users interactdaily with the social networking system generating petabytes of data.

The big data analysis system 100 allows analysis of the large amount ofdata generated by the enterprise. The big data analysis system 100 mayinclude a large number of processors for analyzing the data of theenterprise 110. In some embodiments, the big data analysis system 100 ispart of the enterprise 110 and utilizes computer systems 120 of theenterprise 110. Data from the computer systems 120 of enterprise 110that generate the data may be imported 155 into the computer systemsthat perform the big data analysis.

The client devices 130 are used by users of the big data analysis system100 to perform the analysis and study of data obtained from theenterprise 110. The users of the client devices 130 include dataanalysts, data engineers, and business experts. In an embodiment, theclient device 130 executes a client application that allows users tointeract with the big data analysis system 100. For example, the clientapplication executing on the client device 130 may be an internetbrowser that interacts with web servers executing on computer systems ofthe big data analysis system 100.

Systems and applications shown in FIG. 1 can be executed using computingdevices. A computing device can be a conventional computer systemexecuting, for example, a Microsoft™ Windows™-compatible operatingsystem (OS), Apple™ OS X, and/or a Linux distribution. A computingdevice can also be a client device having computer functionality, suchas a personal digital assistant (PDA), mobile telephone, video gamesystem, etc.

The interactions between the client devices 130 and the big dataanalysis system 100 are typically performed via a network 150, forexample, via the internet. The interactions between the big dataanalysis system 100 and the computer systems 120 of the enterprise 110are also typically performed via a network 150. In one embodiment, thenetwork uses standard communications technologies and/or protocols. Inanother embodiment, the various entities interacting with each other,for example, the big data analysis system 100, the client devices 130,and the computer systems 120 can use custom and/or dedicated datacommunications technologies instead of, or in addition to, the onesdescribed above. Depending upon the embodiment, the network can alsoinclude links to other networks such as the Internet.

System Architecture

FIG. 2 shows the system architecture of a big data analysis system, inaccordance with an embodiment. A big data analysis system 100 comprisesa distributed file system 210, an in-memory cluster computing engine220, a distributed data framework 200, an analytics framework 230, a webserver 240, a custom application server 260, and a programming languageinterface 270. The big data analysis system 100 may include additionalor less modules than those shown in FIG. 2. Furthermore, specificfunctionality may be implemented by modules other than those describedherein. The big data analysis system 100 is also referred to herein as adata analysis system or a system.

The distributed file system 210 includes multiple data stores 250. Thesedata stores 250 may execute on different computers. The distributed filesystem 210 may store large data files that may store gigabytes orterabytes of data. The data files may be distributed across multiplecomputer systems. In an embodiment, the distributed file system 210replicates the data for high availability. Typically, the distributedfile system 210 processes immutable files to which writes are notperformed. An example of a distributed file system is HADOOP DISTRIBUTEDFILE SYSTEM (HDFS).

The in-memory cluster computing engine 220 loads data from thedistributed file system 210 into a cluster of compute nodes 280. Eachcompute node includes one or more processors and memory for storingdata. The in-memory cluster computing engine 220 stores data in-memoryfor fast access and fast processing. For example, the distributed dataframework 200 may receive repeated queries for processing the same datastructure stored in the in-memory cluster computing engine 220, thedistributed data framework 200 can process the queries efficiently byreusing the data structure stored in memory without having to load thedata from the file system. An example of an in-memory cluster computingengine is the APACHE SPARK system.

The distributed data framework 200 provides an abstraction that allowsthe modules interacting with the distributed data framework 200 to treatthe underlying data provided by the distributed file system 210 or thein-memory cluster computing engine 220 interface as structured datacomprising tables. The distributed data framework 200 supports anapplication programming interface (API) that allows a caller to treatthe underlying data as tables. For example, a software module caninteract with the distributed data framework 200 by invoking APIssupported by the distributed data framework 200.

Furthermore, the interface provided by the distributed data framework200 is independent of the underlying system. In other words, thedistributed data framework 200 may be provided using differentimplementations in-memory cluster computing engines 220 (or differentdistributed file systems 210) that are provided by different vendors andsupport different types of interfaces. However, the interface providedby the distributed data framework 200 is the same for differentunderlying systems.

The table based structure allows users familiar with database technologyto process data stored in the in-memory cluster computing engine 220.The table based distributed data structure provided by the distributeddata framework is referred to as distributed data-frame (DDF). The datastored in the in-memory cluster computing engine 220 may be obtainedfrom data files stored in the distributed file system 210, for example,log files generated by computer systems of an enterprise.

The distributed data framework 200 processes large amount of data usingthe in-memory cluster computing engine 220, for example, materializationand transformation of large distributed data structures. The distributeddata framework 200 performs computations that generate smaller sizedata, for example, aggregation or summarization results and providesthese results to a caller of the distributed data framework 200. Thecaller of the distributed data framework 200 is typically a machine thatis not capable of handling large distributed data structures. Forexample, a client device may receive the smaller size data generated bythe distributed data framework 200 and perform visualization of the dataor presentation of data via different types of user interfaces.Accordingly the distributed data framework 200 hides the complexity oflarge distributed data structures and provides an interface that isbased on manipulation of small data structures, for example, databasetables.

In an embodiment, the distributed data framework 200 supports SQL(structured query language) queries, data table filtering, projections,group by, and join operations based on distributed data-frames. Thedistributed data framework 200 provides transparent handling of missingdata, APIs for transformation of data, and APIs providingmachine-learning features based on distributed data-frames.

The analytics framework 230 supports higher level operations based onthe table abstraction provided by the distributed data framework 200.For example, the analytics framework 230 supports collaboration usingthe distributed data structures represented within the in-memory clustercomputing engine 220. The analytics framework 230 supports naming ofdistributed data structures to facilitate collaboration between users ofthe big data analysis system 100. In an embodiment, the analyticsframework 230 maintains a table mapping user specified names tolocations of data structures.

The analytics framework 230 allows computation of statistics describinga DDF, for example, mean, standard deviation, variance, count, minimumvalue, maximum value, and so on. The analytics framework 230 alsodetermines multivariate statistics for a DDF including correlation andcontingency tables. Furthermore, analytics framework 230 allows groupingof DDF data and merging of two or more DDFs. Several examples of thetypes of computations supported by the analytics framework 230 aredisclosed in the Appendix.

The big data analysis system 100 allows different types of interfaces tointeract with the underlying data. These include programming languagebased interfaces as well as graphical user interface based userinterfaces. The web server 240 allows users to interact with the bigdata analysis system 100 via browser applications or via web services.The custom application server 260,

The web server 240 receives requests from web browser clients andprocesses the requests. The web browser requests are typically requestssent using a web browser protocol, for example, a hyper-text transferprotocol (HTTP.) The results returned to the requester is typically inthe form of markup language documents, for example, documents specifiedin hyper-text markup language (HTML).

The custom application server 260 receives and processes requests fromcustom applications that are designed for interacting with big dataanalysis system 100. For example, a customized user interface receivesrequests for the big data analysis system 100 specified using dataanalysis languages, for example, the R language used for statisticalcomputing. The customized user interface may use a proprietary protocolfor interacting with the big data analysis system 100.

The programming language interface 270 allows programs written inspecific programming languages supported by the big data analysis system100 to interact with the big data analysis system 100. For example,programmers can interact with the data analysis system 100 using PYTHONor JAVA language constructs.

The distributed data framework 200 supports various types of analyticsoperations based on the data structures exposed by the distributed dataframework 200. FIG. 3 shows various modules within a distribute dataframework module, in accordance with an embodiment of the invention. Asshown in FIG. 3, the distribute data framework 200 includes adistributed data-frame manager 210 and handlers including ETL handler320, statistics handler 330, and machine learning handler 340. Otherembodiments may include more or fewer modules than those shown in FIG.3.

The distributed data-frame manager 310 supports loading data from bigdata sources of the distributed data file system 210 into DDFs. Thedistributed data-frame manager 310 also manages a pool of DDFs. Thevarious handlers provide a pluggable architecture making it easy toinclude new functionality into or replace existing functionality fromthe distribute data framework 200. The ETL handler 320 supports ETL(extract, transform, and load operations, the statistics handler 330supports various statistical computations applied to DDFs, and themachine learning handler 340 supports machine learning operations basedon DDFs.

In an embodiment, the distributed data framework 200 provides interfacesin different programming languages including Java, Scala, R and Pythonso that users can easily interact with the in-memory cluster computingengine 220. In a client/server setting, a client can connect to adistributed data framework 200 via a web browser or a custom applicationbased interface and issue commands for execution on the in-memorycluster computing engine 220.

The distributed data framework 200 allows users to load a DDF in memoryand perform operations on the data stored in memory. These includefiltering, aggregating, joining a data set with another and so on. Sincea client device 130 has limited resources in terms of computing power ormemory, a client device 130 is unable to load an entire DDF from thein-memory cluster computing engine 220. Therefore, the distributed dataframework 200 supports APIs that allow a subset of data to be retrievedfrom a DDF by a requestor.

In an embodiment, distributed data framework 200 supports an API“fetchRows(df, N)” that allows the caller to retrieve the first N rowsof a DDF df. If the distributed data framework 200 receives a request“fetchRows(df, N)”, the distributed data framework 200 identifies thefirst N rows of the DDF df and returns the identified rows to thecaller.

The distributed data framework 200 supports an API “sample(df, N)” thatallows the caller to retrieve a sample of N rows of a DDF df. Inresponse to a request “sample(df, N)”, the distributed data framework200 samples data of the DDF df based on a preconfigured samplingstrategy and returns a set of N rows obtained by sampling to the caller.

The distributed data framework 200 supports an API “sample2ddf(df, p)”that allows the caller to compute a sample of p % of rows of the DDF dfand assign the result to a new DDF. In response to a request“sample2ddf(df, p)”, the distributed data framework 200 samples data ofthe DDF df based on a preconfigured sampling strategy to identify p % ofrows of the DDF df and creates a new DDF based on the result. Thedistributed data framework 200 returns the result DDF to the caller, forexample, by sending a reference or pointer to the DDF to the caller.

Collaboration Using Distributed Data Frame Structures

In an embodiment, the distributed data-frame manager 310 acts as aserver that allows users to connect and create sessions that allow theusers to interact with the distributed data framework 200 and processdata. Accordingly, the distributed data framework 200 creates sessionsallows users to maintain distributed in-memory data structures in thein-memory cluster computing engine 220 for long periods of times, forexample, weeks. Furthermore, the session maintains the state of thein-memory data structures so as to allow a sequence of multipleinteractions with the same data structure. The interactions includerequests that modify the data structure such that a subsequent requestcan access the modified data structure. As a result, a user can performa sequence of operations to modify the data structure.

The distributed data-frame manager 310 allows users to collaborate usinga particular large distributed data structure (i.e., DDF). For example,a particular user can create a DDF loading data from log files of anenterprise into an in-memory data structure, perform varioustransformations, and share the data structure with other users. Theother user may continue making transformations or view the data from theDDF in a user interface, for example, build a chart based on the data ofthe DDF for presentation to an audience.

The analytics framework 230 receives a user request (or a request from asoftware module) to assign a name to a distributed data structure, forexample, a DDF or a machine learning model. The analytics framework 230may further receive requests to provide the name of the distributed datastructure to other users or software modules. Accordingly, the analyticsframework 230 allows multiple users to refer to the same distributeddata structure residing in the in-memory cluster computing engine 220.

The analytics framework 230 supports an API that sets the name of adistributed data structure, for example, a DDF or any data set. Forexample, a user may invoke a function (or method) “setDDFName(df, stringname)” where “df” is a reference/pointer to the distributed datastructure stored in the in-memory cluster computing engine 220 and“string_name” is an input string specified by the user for use as thename of the “df” structure. The analytics framework 230 processes thefunction setDDFName by assigning the name “string_name” to the structure“df”. For example, a user may execute queries to generate a data setrepresenting flight information based on data obtained from airlines. Afunction/method call “setDDFName(df, “flightinfo”)” assigns name“flightinfo” to the data set identified by df.

The analytics framework 230 further supports an API to get a uniformresource identifier (URI) for a data structure. For example, theanalytics framework 230 may receive a request to execute “getURI(df)”.The analytics framework 230 generates a URI corresponding to the datastructure or data set represented by df and returns the URI to therequstor. For example, the analytics framework 230 may generate a URI“ddf://servername/flightinfo” in response to the “getURI(df)” call. TheURI may be provided to an application executing on a client device 130.

The analytics framework 230 maintains a mapping from DDFs to URIs. If aDDF is removed from memory, the corresponding URI becomes invalid. Forexample, if a client application presented a document having a URI thathas become invalid, the data analysis system 100 does not processqueries based on that URI. The data analysis system 100 may return anerror indicating that the query is directed to an invalid (ornon-existent) DDF. If the data analysis system 100 loads the same set ofdata (as the DDF which is removed from the memory), as a new DDF, theclient devices request a new URI for the newly created DDF. This is sobecause the new DDF may have a different location within theparallel/distributed system and may be distributed differently from thepreviously loaded DDF even though the two DDFs store the same identicaldata. In an embodiment, the data analysis system 100 may two or morecopies of the same DDF. For example, two or more clients may requestaccess to the same data set with the possibility of makingmodifications. In this situation, each DDF representing the same data isassigned a different URI. For example, a first DDF representing a firstcopy of the data is assigned a first URI and a second DDF representing asecond copy of the same data is assigned a second URI distinct from thefirst URI. Accordingly, requests for processing received by the dataanalysis system 100 based on the first URI are processed using the firstDDF and, requests for processing received by the data analysis system100 based on the second URI are processed using the second DDF.

The URI can be communicated between applications or client devices. Forexample, the client device 130 may communicate the URI to another clientdevice. Alternatively, an application that has the value of the URI cansend the URI to another application. For example, the URI may becommunicated via email, text message, or by physically copying andpasting from one user interface to another user interface. The recipientof the URI can use the URI to locate the DDF corresponding to the URIand process it. For example, the recipient of the URI can use the URI toinspect the data of the DDF or to transform the data of the DDF.

FIG. 4 illustrates collaboration between two client devices using a DDF,according to an embodiment of the invention. Assume that client device130 a has access to an in-memory distributed data structure, forexample, a DDF 420 stored in the in-memory cluster computing engine 220.The client device 130 a has access to the DDF 420 because an applicationon client device 130 a created the DDF 420. Alternatively, the clientdevice 420 may have received a reference to the DDF 420 from anotherclient device or application. The distributed data framework 200receives 425 a request to generate a URI (or a name) corresponding tothe DDF 420. The distributed data framework 200 generates a URIcorresponding to the DDF 420. The distributed data framework sends 435the generated URI to the client device 130 a that requested the URI.

The client device 130 a may send 450 the URI corresponding to the DDF420 to another client device 130 b, for example, via a communicationprotocol such as email. In some embodiments, the URI may be sharedbetween two applications running on the same client device. For example,the URI may be copied from a user interface of an application and pastedinto the user interface of the other application by a user. The clientdevice 130 b receives the URI. The client device 130 b can send 440requests to the distributed data framework 200 using the URI to identifythe DDF 420. For example, the client device 130 b can send a request toreceive a portion of data of the DDF 420.

In an embodiment, the in-memory cluster computing engine 220 stores adistributed data structure that represents a machine learning model. Thedistributed data framework 200 receives a request to create a name orURI identifying the machine learning model from a client device 130. Thedistributed data framework 200 generates the URI or name and provides tothe client device 130. The client device receiving the URI can transmitthe URI to other client devices. Any client device that receives the URIcan interact with the distributed data framework 200 to interact withthe machine learning model, for example, to use the machine learningmodel for predicting certain behavior of entities. These embodimentsallow a user (or users) to implement the machine learning model andtrain the model. The model is stored in-memory by the big data analysissystem 100 and is ready to use by other users. The access to thein-memory model is provided by generating the URI and transmitting theURI to other applications or client devices. Users accessing the otherclient devices or applications can start using the machine learningmodel stored in memory.

Converting Immutable Datasets to Mutable Distributed Data FrameStructures

In some embodiments, the in-memory cluster computing engine 220 supportsonly immutable data sets. In other words, a user (e.g., a softwaremodule that creates/loads the data set for processing) of the data setis not allowed to modify the dataset. For example, the in-memory clustercomputing engine 220 may not provide any methods/functions or commandsthat allow a data set to modify. Alternatively, the in-memory clustercomputing engine 220 may return error if a user attempts to modify dataof a data set.

The in-memory cluster computing engine 220 may not support mutabledatasets if the in-memory cluster computing engine 220 supports afunctional paradigm, i.e., a functional programming model based onfunctions that take a data set as input and return a new data set uponeach invocation. As a result, each operation requires invocation of afunction that returns a new data set. Accordingly, the in-memory clustercomputing engine 220 does not support modification of states of datasets (these datasets may be referred to as stateless datasets.)

The distributed data framework 200 allows users to convert a dataset toa mutable dataset. For example, the distributed data framework 200supports a method/function “setMutable(ddf)” that converts an immutabledataset (or DDF) input to the method/function to a mutable DDF.Subsequently, the distributed data framework 200 allows users to makemodifications to the mutable DDF. For example, the distributed dataframework 200 may add new rows, delete rows, modify rows, and so on fromthe mutable DDF based on requests.

The distributed data framework 200 implements a data structure, forexample, a table that tracks all DDFs that are mutable. A mutable DDFcan have a long life since a caller may continue to make a series ofmodifications to the DDF. A user of the DDF may even pass a reference tothe DDF to another user, thereby allowing the other user to continuemodifying the dataset. In contrast, immutable datasets have a relativelyshort life since the dataset cannot be modified and is used as a readonly value that input to a function (or output as a result by afunction).

The distributed data framework 200 maintains metadata that tracks eachmutable DDF. In an embodiment, the distributed data framework 200implements certain mutable operations by invoking functions supported bythe in-memory cluster computing engine 220. Accordingly, the distributeddata framework 200 updates the metadata pointing at the DDF with a newDDF returned by the in-memory cluster computing engine 220 as a resultof invocation of the function. Subsequent requests to process the DDFare directed to the new DDF structure pointed at by the metadataidentifying the DDF. As a result, even though the underlyinginfrastructure of the in-memory cluster computing engine 220 supportsonly immutable data structures, the user of the distributed dataframework 200 is able to manipulate the data structures as if they aremutable.

FIG. 5 illustrates a process for converting immutable data sets receivedfrom the in-memory cluster computing engine to mutable data sets,according to an embodiment of the invention. The distributed dataframework 200 receives a request to build a DDF structure. In anembodiment, the DDF structure may be built by loading data from a set offiles of the distributed file system 210, for example, by processing aset of log files of an enterprise.

The distributed data framework 200 sends 520 a request to the in-memorycluster computing engine 220 to retrieve the data of the requested dataset. In an embodiment, the in-memory cluster computing engine 220supports only immutable data sets and does not allow or supportmodifications to datasets. The in-memory cluster computing engine 220loads the requested dataset in memory. The dataset may be distributedacross memory of a plurality of compute nodes 280 of the in-memorycluster computing engine 220 (the dataset is also referred to as a DDF.)

The distributed data framework 200 marks the data set as immutable (forexample, by storing a flag indicating the dataset as mutable inmetadata.) This step may be performed if the default type of datasetssupported by the in-memory cluster computing engine 220. Accordingly, ifthe distributed data framework 200 receives a request to modify the dataof the dataset, for example, by deleting existing data, adding new data,or modifying existing data, the distributed data framework 200 deniesthe request. In an embodiment, the dataset is represented as a DDFstructure. In other embodiments, the distributed data framework 200 maymark all DDF structures as mutable when they are created.

The distributed data framework 200 receives 530 a request to convert thedataset to a mutable dataset. In an embodiment, the request to convertthe dataset may be supported as a method/function call, for example,“setMutable” method/function. A caller may invoke the “setMutable”method/function providing a DDF structure as input. The distributed dataframework 200 updates 540 metadata structure describing the DDF toindicate that the DDF is mutable.

Subsequently, the distributed data framework 200 receives 550 a requestto modify the DDF, for example, by adding data, deleting data, orupdating data. The distributed data framework 200 performs the requestedmodifications to the DDF. In an embodiment, the distributed dataframework 200 invokes a function of the in-memory cluster computingengine 220 corresponding to the modification operation. The in-memorycluster computing engine 220 generates 560 a new dataset that has thevalue equivalent to the requested modified DDF. The distributed dataframework 200 modifies the metadata describing the DDF to refer to themodified DDF instead of the original DDF. Accordingly, if a requesteraccesses the data of the DDF, the requester receives the data of themodified DDF. Similarly, if a requester attempts to modify the DDFagain, the new modification is applied to the modified DDF as identifiedby the metadata.

Collaboration Using Documents Based on Big Data Reports

Embodiments allow multiple collaborators to interact with a document. Acollaborator can be represented as a user account of the system. Eachuser account may be associated with a client device, for example, amobile device such as a mobile phone, a laptop, a notebook, or any othercomputing device. A collaborator may be represented as a client device.Accordingly, the shared document is shared between a plurality of clientdevices.

The document includes information based on one or more DDFs stored inthe in-memory cluster computing engine 220, for example, a chart basedon the DDF in the document. The collaborators can interact with thedocument in a global editing mode that causes changes to be propagatedto all collaborators. For example, if a collaborator makes changes tothe document or to the DDF identified by the document, all collaboratorssee the modified document (or the modified document.)

In an embodiment, the document collaboration is based on a push model inwhich changes made to the document by any user are pushed to allcollaborators. For example, assume that the document includes a chartbased on a DDF stored in the in-memory cluster computing engine 220. Ifthe distributed data framework 200 receives requests to modify the DDF,the distributed data framework 200 performs the requested modificationsto the DDF and propagates a new chart based on the modified DDF to theclient devices of the various collaborators sharing the document.

The distributed data framework 200 allows a collaborator X (or a set ofcollaborators) to switch to a local editing mode in which the changesmade by the collaborator X (or any collaborator from the set) to aspecified portion of the shared document are local and are not sharedwith the remaining collaborators. The local editing mode is alsoreferred to herein as limited sharing mode, limited editing mode, orediting mode. For example, if the collaborator modifies the DDF, thechanges based on the modifications to the DDF are visible in thedocument to only the collaborator X. The distributed data framework 200does not propagate the modifications to the document (or to a portion ofthe document or the DDF associated with the document) to the remainingcollaborators. Accordingly, the distributed data framework 200 continuespropagating the original document and any information based on theversion of the DDF before collaborator X switched to local editing modeto the remaining collaborators. In an embodiment, the remainingcollaborators can modify the original document (and associated DDFs) andthe distributed data framework 200 does not propagate the modificationsto user X. The collaborator X may share the modified document based onthe local edits to a new set of collaborators. Accordingly, the new setof collaborators can continue modifying the version of document createdby user X without affecting the document edited by the original set ofusers.

In an embodiment, the local edits to the shared document are sharedbetween a set of collaborators. Accordingly, if any of the collaboratorsfrom the set of collaborators makes a modification to the shareddocument, the modifications are propagated to only the set ofcollaborators identified for the local editing. This allows a team ofcollaborators to make modifications to the shared document before makingthe modifications publicly available to a larger group of collaboratorssharing the document.

In an embodiment, the distributed data framework 200 receives a requestthat identifies a particular portion of the shared document for localediting. Furthermore, the request received specifies a set ofcollaborators for sharing local edits to the identified portion.Accordingly, any modifications made by the collaborators of the set tothe identified portion are propagated to all the collaborators of theset. However, any modifications made by any collaborator to the shareddocument outside the identified portion are propagated to allcollaborators that share the document, independent of whether thecollaborator belongs to the specified set or not.

FIG. 6 shows the process illustrating use of a global editing mode and alocal editing mode during collaboration, according to an embodiment ofthe invention. The distributed data framework 200 receives 610 a requestto create a document. In an embodiment, the document is associated withone or more DDFs and receives information associated with the DDF. Forexample, the document may include a chart based on information availablein the DDF. The distributed data framework 200 updates the chart basedon changes to the data of the DDF. For example, if the distributed dataframework 200 receives an update to the data of the DDF (includingrequests to delete, add, or modify data), the distributed data framework200 modifies the chart to display the modified information.

The collaboration module 370 receives 620 a request to share thedocument with a first plurality of collaborators. The collaborationmodule 370 receives 630 requests to interact with the shared documentfrom the first plurality of collaborators. The requests may includerequests to edit the document, requests to make modifications to thedata of the DDF, and so on.

The collaboration module 370 further receives a request from aparticular collaborator (say collaborator X) to perform local editing ona selected portion of the shared document (or the entire document). Therequest identifies a set of collaborators that share the local edits tothe selected portion of the document. The collaboration module 370 maycreate 650 a copy of data related to the identified portion of theshared document for collaborator X to perform local editing. The copy ofthe portion of the shared document is called the locally accessibledocument and the original shared document (which can be edited by allcollaborators) is called the globally accessible document.

In an embodiment, the collaboration module 370 shares the associatedDDFs between the locally accessible document and the globally accessibledocument when the locally accessible document is created. However, ifthe distributed data framework 200 receives a request from any of thecollaborators to modify an underlying DDF, the distributed dataframework 200 makes a copy of the DDF and modifies the copy of the DDF.One of the documents subsequently is associated with the modified DDFand the other document is associated with the original DDF.

In an embodiment, the collaboration module 370 may obtain a subset ofdata of the DDF that provides data to the chart displayed on the locallyedited document. For example, the chart may display data for a smalltime period out of a longer period of data stored in the DDF.Alternatively, the chart may display partially aggregated data. Forexample, the DDF may store data at an interval of seconds and the chartmay display data aggregated data at intervals of days. Accordingly, thedistributed data framework 200 determines the aggregated data that maybe much smaller than the total data of the DDF and can be stored on theclient device instead of the in-memory cluster computing engine.

In an embodiment, the distributed data framework 200 checks if the sizeof the aggregated data is below a threshold value. If the size of theaggregated data is below a threshold value, the distributed dataframework 200 sends the data to the client device 130 for furtherprocessing. The client device can perform certain operations based onthe locally stored data, for example, further aggregation based on thedata. Processing the locally stored data allows the client device toefficiently process user requests. For example, if the user wants toview a smaller slice of data than that shown on the chart, the clientdevice 130 can use the locally stored data to respond to the query.Accordingly, the chart displayed on the client device is updated withoutupdating the charts displayed on the remaining client devices that sharethe original document.

Similarly, if the client device requests to further aggregate the data,for example, by requesting aggregates at the intervals of weeks ormonths, the request can be processed using the locally stored data. Inan embodiment, the data set associated with the chart (for example, thepartially aggregated data) is stored on another system distinct from thedistributed data framework 200 and the client device. The other systemallows large data sets to be loaded in memory that exceed the capacityof the client device 130.

The collaboration module 370 determines which copy of the document isassociated with the modified DDF and which copy of the document isassociated with the original DDF. If the request to modify the DDF isreceived from the locally accessible document, the distributed dataframework 200 associates the locally accessible document with themodified DDF and the globally accessible document with the original DDF.Alternatively, if the request to modify the DDF is received from theglobally accessible document, the distributed data framework 200associates the globally accessible document with the modified DDF andthe locally accessible document with the original DDF

The collaboration module 370 receives 660 a request to share the locallyaccessible document with other collaborators (referred to here as asecond plurality of collaborators). The second plurality ofcollaborators may overlap with the first plurality of collaborators. Thecollaboration module 370 provides access to the document to the secondplurality of collaborators. The distributed data framework 200 receivesrequest to modify the locally accessible document from collaboratorsbelonging to the second plurality of collaborators.

The ability to locally edit a portion of the document allows one or morecollaborators to modify the document before making the modificationspublicly available to all collaborators. For example, a portion of theshared document may be associated with a query that processes anin-memory distributed data structure. The portion of the document mayshow results of the query as a chart of in text form or both. The systemallows one or more collaborators to develop and test the query in alocal edit mode to make sure the chart presented is accurate. Once thecollaborators have fully developed and tested the query and the chart,the system receives a request from the collaborators to share theidentified portion with all users that share the document (not just thedevelopers and testers of the query and the chart.)

In an embodiment, the system determines a target set of collaboratorsthat receive each modification made to the shared document. The targetset of collaborators is determined based on whether the modification ismade to the portion identified for local editing or another portion.Accordingly, if the system receives a request to modify a portion of thedocument that is distinct from the portion identified for local editing,the system propagates the changes to the all collaborators sharing thedocument. This is so because by default all portions of the document aremarked for global editing by all collaborators. However, if the systemreceives a request to modify the portion identified for local editingand the request is received from a collaborator from the set ofcollaborators S allowed to perform local editing on that portion, thesystem propagates the modification to all collaborators from the set ofcollaborators. In an embodiment, the collaborators not belonging to theset S of collaborators are allowed to modify the portion identified forthe local editing. However, the system propagates these modificationsonly to collaborators that do not belong to the set S of collaboratorsallowed to perform local editing to the identified portion. In anembodiment, the system maintains a separate copy of the identifiedportion. Accordingly, the modifications made by users of the set S aremade to one copy of the document (and propagated to the collaboratorsbelonging to S) and the modifications made by users outside set S aremade to another copy (and propagated to the collaborators outside S).

In an embodiment, the portion of the shared document identified forlocal editing includes a query Q1 processing a DDF associated with theshared document. Assume that a set S1 of collaborators are allowed toperform local edits to the document. The selected portion of thedocument may include a chart of the document associated with the queryor result of the query in text form. The local edits made bycollaborators of set S may modify Q1 to become a query Q2. Accordingly,a chart based on query Q2 is propagated to collaborators belonging toset S1 and a chart based on the original query Q1 is propagated to theremaining collaborators (outside set S1) that share the document. If thedata of the DDF is modified, the queries Q1 and Q2 are reevaluated tobuild a new corresponding chart (or textual representation of theresult). The chart or results based on query Q2 are propagated to thecollaborators of set S1 and the charts or results based on query Q1 arepropagated to the remaining collaborators (outside the set S1.)

The system allows various portions of the same shared document to belocally edited by different collaborators or different sets ofcollaborators. For example, the system may receive a first request toallow local editing of a first portion of the shared document by a firstset of collaborators. Subsequently the system may receive a secondrequest to allow local editing of a second portion of the shareddocument by a second set of collaborators. The first and second set ofcollaborators may overlap or may be distinct.

Collaborative Code Editing Via Shared Documents

A shared document includes text portions, result portions, and codeblocks. A text portion is received from a user and shared with otherusers. The shared document may be associated with one or more DDFsstored across a plurality of compute nodes. A code block may processdata of a DDF. The code block may include queries that are executed. Theresult of execution of a query is displayed on the result portions ofthe document, for example, as charts. A code block is also referred toherein as a cell.

Embodiments allow references to DDFs to be included in documents. Usersinteracting with the big data analysis system 100 can share documentsand interact with the same shared document via different client devices130. If two or more documents share a DDF, changes made to the DDF via adocument result in data displayed on the other documents being modified.For example, documents D1 and D2 may be distinct documents that havereferences to a DDF df. Document D1 may be shared by a set of users S1and document D2 may be shared by a set of users S2 where S1 and S2 maybe distinct sets of users with no overlap. However if a user U1 from setS1 executes code via document D1 that modifies the DDF df, a user U2from set S2 can view the modifications to the DDF df even though U2 isnot sharing the document D1 with user U1. For example, the codemodifications made by user U1 via document D1 may cause a chart or aresult set displayed on document D2 to be updated as a result ofmodifications made to DDF df.

FIG. 7 shows collaborative editing of documents including text, code,and charts based on distributed data structures, according to anembodiment. As shown in FIG. 7, the distributed data framework 200stores two shared documents 710 a and 710 b. The shared documentsinclude references to the in-memory distributed data frame structure 420(referred to as the DDF) stored in the in-memory cluster computingengine 220.

Each shared document 710 is associated with a set 730 of users 720interacting with the shared document 710 via client devices 130. Forexample, users 720 p and 720 q interact with shared document 710 a viaclient devices 130 p and 130 q respectively. Similarly, users 720 r and720 s interact with shared document 710 b via client devices 130 r and130 s respectively. There may be more or less number of users in eachset 730 of users sharing a document than those indicated in FIG. 7. Thesets 730 of users sharing a document may be distinct from a set of userssharing another document (i.e., have no overlap between the sets) or thesets of users sharing two distinct shared documents 710 may have anoverlap (with one or more users having access to both the shareddocuments).

The shared document 710 may include text, code, and results based oncode. The results based on code may comprise results present as text orresults presented as charts, for example, bar charts, scatter plots, piecharts and so on. In an embodiment, the results presented in a documentare associated with in-memory distributed data frame structure 420(referred to as the DDF) stored in the in-memory cluster computingengine 220. For example, the document may specify a query based on theDDF such that the results/chart displayed in the document are based onthe result of executing the query against the DDF. The code specified inthe document may include a query for which the results are shown in thedocument. If a user updates the query of the shared document, each ofthe users that share the document, receive updated results displayed inthe shared document.

The code specified in the document may include statements that modifythe DDF, for example, by deleting, adding, updating rows, columns, orany other portion of data of the DDF. If a user modifies the DDF, theresults displayed in the document may get updated based on the modifiedDDF. For example, if certain rows of the DDF are deleted, any aggregateresults displayed in the document or charts based on the DDF may getupdated to reflect the deletion of the rows. Furthermore, if there areother documents that share the same DDF (for example, by including a URIto the DDF), the results/chart displayed in those documents may beupdated to reflect the modifications to the shared DDF.

The shared documents may represent articles, presentations, reports andso on. The collaborative editing allows users to include charts andresults of large distributed data structures in documents. For example,a team of developers may build an in-memory distributed data structureand share the URI of the in-memory distributed data structure with anexecutive for presentation to an audience. The ability to share thein-memory distributed data structure allows the ability to update thedata structure to reflect the latest information. This is distinct froma presentation with static information that doesn't change no matterwhen the presentation is given. In contrast, the sharing of documentswith code and results based on executable code allows presentation oflatest results that may get updated as the executive makes thepresentation.

As shown in FIG. 7, user U1 may update text of the shared document 710 a(the text referring to static strings of text that are distinct fromexecutable code). As a result of editing of the text by user U1, theupdated text of the document is sent by the distributed data framework200 to all users (e.g., users U1, U2) that share the document 710 a.However, the result of editing the text does not affect other documentsthat are distinct from the updated document, for example, document 710 band therefore, the users of set 730 b are not affected by the updatingof the text of document 710 a.

Embodiments further allow sharing of executable code and results basedon sharing of code with other users. As shown in FIG. 7, user U1 mayupdate a query included in shared document 710 a. The modification ofthe query may cause changes to results or charts presented in thedocument. As a result, the distributed data framework 200 executes theupdated query using the DDF references in the document 710 a and sendsthe updated results or chart to the remaining users of the set 730 a(e.g., user U2.) If the query is simply reading the data of the DDF, theresult of modification of the query is presented only to the users ofthe set 730 a that share the modified document (and is not shown tousers of set 730 b that share the document 710 b).

Embodiments further allow a user to execute code that modifies the DDFreferenced by the shared document 710 a. As a result, the data of theDDF may be changed (e.g., deleted, updated, or new data added.) Themodification of DDF may cause results of queries of the document to beupdated if the queries use the DDF. Accordingly, the distributed dataframework 200 identifies all queries of the document 710 a that use theDDF and updates the results of the queries displayed in the document.The updated document is sent for presentation to the users of the set730 a.

Furthermore, the distributed data framework 200 identifies all otherdocuments of that include a reference to the DDF. The distributed dataframework 200 identifies queries of all the identified documents andupdates the results/charts of the queries displayed in the respectivedocuments if necessary. The distributed data framework 200 sends theupdated documents for presentation to all users that share the document.For example, the DDF 420 may be updated based on execution of code ofshared document 710 a. The distributed data framework 200 updatesresults of queries based on the DDF 420 in document 710 a as well asdocument 710 b. The updated document 710 a is sent to users of the set730 a and the updated document 710 b is sent to the users of the set 730b.

In an embodiment, a user may request a document or a portion of adocument to be locally edited (and not shared). In this embodiment, thedistributed data framework 200 makes a copy of the DDF 420 or anintermediate result set based on the DDF 420. In some embodiments, thedistributed data framework 200 simply notes that the document is beinglocally edited and continues to share the DDF 420 with other documentsuntil the DDF 420 is edited. If the DDF 420 is edited, the distributeddata framework 200 makes a copy of the DDF for the document that isbeing locally edited. Note that the document being locally edited may beshared by a set of users even though it does not share the DDFreferenced in the document with other documents.

In an embodiment, the portion of the document being locally edited isbased on an intermediate result derived from the DDF 420. Accordingly,the distributed data framework 200 stores the intermediate result ineither the in-memory cluster computing engine 220 (if the intermediateresult is large) or else in a separate server (that may not bedistributed). In an embodiment, the intermediate result is stored in theclient device 130. Certain operations based on the intermediate resultscan be performed based on the data of the intermediate result, forexample, aggregation of the intermediate results, changing the formal ofthe chart (so long as the new format does not require additional datafrom the DDF). For example, a bar chart may be changed to a line chartbased on the intermediate result. However, changing of a bar chart to ascatter plot may require accessing the DDF for obtaining a new sampledata (for example, if the user requests to display a scatter plot basedon a subset of data of the bar chart.)

FIG. 8 shows an example of shared document with code and results basedon code, according to an embodiment. As shown in FIG. 8, the document800 includes executable code 810. The document 800 also includes resultsof execution of the executable code. The example executable code shownin FIG. 8 is “m<-adataalm(total_amount˜trip_distance+payment_type,data=ddf)”. The executable code 810 is based on a DDF (identified as“ddf”) used in a machine learning model (identified as “lm”). Thisexample code builds a linear model of total_amount as function oftrip_distance and payment_type. The result 820 of execution of thelinear model using the DDF ddf is shown in text form in FIG. 8. In otherembodiments, the result of execution of a command can be shown in achart form.

The user can modify the executable code, thereby causing updated resultsto be presented to all users sharing the document. FIG. 9 shows thedocument of FIG. 8 with modifications made to the code, thereby causingthe results to be updated, according to an embodiment. The model shownin FIG. 8 is updated to change the model. The distributed data framework200 executes the modified code 910 to obtain a new set of results 920.The updated results are presented to all users that share the document800. If a modification is made to the data of the DDF ddf, the resultsof all documents that include executable code based on the DDF ddf getsmodified.

The analytics framework 230 generates reports, presentations, ordashboards based on the document comprising the text, code, and results.FIG. 10 shows an example of a dashboard generated from a document basedon distributed data structures, according to an embodiment. Theanalytics framework 230 receives information identifying portions of theshared document that should not be displayed in the generatedreport/dashboard. For example, a user may indicate that portions of thedocument including executable code should not be displayed in thereport/presentation/dashboard. The analytics framework 230 generates therequested report/presentation/dashboard by rendering the informationidentified for inclusion and excluding the information identified forexclusion.

In an embodiment, the analytics framework 230 receives a request toconvert the shared document into a periodic report. The analyticsframework 230 receives a schedule for generating the periodic report.The analytics framework 230 executes the code blocks of the shareddocument in accordance with the schedule. Accordingly, the analyticsframework 230 updates the result portions of the shared document basedon the latest execution of the code block. For example, a code block mayinclude a query based on a DDF. The analytics framework 230 updates theresult portion corresponding to the code block based on the latest dataof the DDF. The analytics framework 230 shares the updated document withusers that have access to the shared document. These embodiments allowthe analytics framework 230 to provide a periodic report to users. Forexample, the shared document may include a reference to a DDF based onan airlines database and the analytics framework 230 provides weekly ormonthly reports to the users sharing the document. Similarly, theanalytics framework 230 can convert the shared document into a slideshow or a dashboard based on a user request.

In an embodiment, the analytics framework 230 receives a request togenerate a periodic report, slideshow, or a dashboard and generates therequested document based on the shared document rather than convert theshared document as requested. The analytics framework 230 maintains theperiodic reports, slideshows, or dashboards as shared documents that canbe further edited and shared with other users. Accordingly, variousoperations disclosed herein apply to these generated/transformeddocuments.

FIG. 10 shows various charts 1010 a, 1010 b, and 1010 c shown in thedashboard generated by the analytics framework 230. The charts 1010refer to DDFs stored in the in-memory cluster computing engine 220.Accordingly, the distributed data framework 200 may automatically updatethe data of the charts 1010 based on updates to the data of theassociated DDF.

In embodiment, the analytics framework 230 identifies all charts in aninput document. The analytics framework 230 determines a layout for allthe charts in a grid, for example, a 3 column grid. The analyticsframework 230 may receive (from the user) a selection of a templatespecifying the layout of the dashboard. The analytics framework 230receives instructions from users specifying modifications to the layout.For example, the big data analysis system 100 allows users to drag-dropcharts snapping to the grid, resize charts within the grid. The big dataanalysis system 100 also allows users to set dashboards to automaticallybe refreshed at a specified time interval e.g. 30 second, 1 minute, etc.The generated dashboard includes instructions to execute any queriesassociated with each chart at the specified time interval by sending thequeries to the distributed data framework 200 for execution.

Various portions of a document that is shared can be edited by all usersthat share the document. In an embodiment, if the system receives a userrequest for execution of a code block (or cell), the system shows anindication that the code block in the shared document is being executed.Accordingly, the system shows a change in the status of the code block.The status of the code block may be indicated based on a color of thetext or background of the code block, font of the code block, or by anyvisual mechanism, for example, by showing the code block as flashing. Insome embodiments, the status of the code block may be shown by a widget,for example, an image or icon associated with the code block.Accordingly, a status change of the code block causes a visual change inthe icon or the widget. The changed status of the code block issynchronized across all client applications or client devices that sharethe document. Accordingly, the system shows the status of the code blockas executing on any client device that is displaying a portion of theshared document including the code block that is executing.

In an embodiment, if the system receives a request to execute a codeblock of the shared document, the system locks the code block of thedocument, thereby preventing any users from editing the code block. Thesystem also prevents other users from executing the code block.Accordingly, the system does not allow any edits to be performed on thecode block that is executing from any client device that is displaying aportion of the shared document including the code block. Users areallowed to modify other portions of the document, for example, textportions or other code blocks. Nor does the system allow the code blockto be executed again from any client device until the current executionis complete. In other words, the system allows a single execution by auser for a code block at a time. Once the execution of the code block iscomplete, the system allows users to edit the code block or execute itagain.

A user may close the client application (e.g., a browser or a user-agentsoftware) used to view/edit the shared document on a client device 130.If the client closes the client application while one or more codeblocks, the system continues executing the code blocks and tracks thestatus of the code blocks. If a user that closed the client applicationreopens the client application to view the document, the system receivesa login request from the user. In response to the request to view theshared document, the system provides the latest status of the codeblocks. If a code block is executing, the system provides informationindicating that the code block is executing and the user is still notallowed to edit or execute the code block. If the code block hascompleted execution, the system updates the result portions of thedocument and sends the updated document to the client device of the userand allows the user to edit or execute the code block.

Multi-Language Documents Processing Distributed Data Structures

Embodiments allow shared documents that interact with the big dataanalysis system 100 using multiple languages for processing data ofDDFs. A shared document includes text portions, result portions, andcode blocks. A text portion is received from a user and shared withother users. The shared document may be associated with one or more DDFsstored across a plurality of compute nodes. A code block may processdata of a DDF.

A user can interact with the DDF by processing a query and receivingresults of the query. The results of the query are displayed in thedocument and may be shared with other users. A user can also execute astatement via the document that modifies the DDF. The distributed dataframework 200 receives statements sent by a user via a document andprocesses the statements (a statement can be a command or a query).

A code block may include instructions that modify the DDF. A code blockmay include queries that are executed by the data analysis system. Theresult of execution of the queries is presented in result portions ofthe shared document. A result portion may present results in text formor graphical form, for example, as charts. Modification of a query by auser in a code block may result in the result portion of all userssharing the document getting updated.

The big data analysis system 100 allows users to send instructions forprocessing data of a DDF using different languages. For example, the bigdata analysis system 100 receives a first set of instructions in a firstlanguage via a document and subsequently a second set of instructions ina second language provided via the same document (or via a differentdocument). Both the first and second set of instructions may processdata of the same DDF. The ability to collaborate via multiple languagesallows different users to use the language of their choice whilecollaborating. Furthermore, certain features may be supported by onelanguage and not another. Accordingly, a user can use a first languagefor providing instructions and operations supported by that language andswitch to a second language to use operations supported by the secondlanguage (and not supported by the first language). In an embodiment,the big data analysis system 100 allows users to specify code cells orcode blocks in a document. Each code block may be associated with aspecific language. This allows a user to specify the language for a setof instructions. In an embodiment, a shared document uses a primarylanguage for processing the DDFs. However, code blocks of one or moresecondary languages may be included.

FIG. 11 shows the architecture of the in-memory cluster computing engineillustrating how a distributed data structure (DDF) is allocated tovarious compute nodes, according to an embodiment. As shown in FIG. 11,the in-memory cluster computing engine 220 comprises multiple computenodes 280. A DDF is distributed across multiple compute nodes 280. Eachcompute node 280 is allocated a portion of the data of the DDF, referredto as a DDF segment 1110. When the distributed data framework 200receives a request to process data of a DDF, the distributed dataframework 200 sends a corresponding request to each compute node 280storing a DDF segment 1110 to process data of the DDF segment 1110. Asshown in FIG. 11, each compute node 280 includes a primary runtime 1120that stores the DDF segment 1110. The DDF segment 1110 has a datastructure based on the primary runtime 1120. For example, the datastructure of the DDF segment 1110 conforms to the primary language ofthe distributed data framework and can be processed by the primaryruntime executing instructions of the primary language.

The primary runtime 1120 is capable of processing instructions in aprimary language of operation for the distributed data framework 100.Accordingly, if a user provides a set of instructions using the primarylanguage, the distributed data framework 100 provides correspondinginstructions to the primary runtime for execution. For example, theprimary runtime 1120 may be a virtual machine of a language, forexample, a JAVA virtual machine for processing instructions received inthe programming language JAVA. Alternatively, the primary runtime 1120may support other programming languages such as PYTHON, R language, orany proprietary languages.

FIG. 12 shows the architecture of the in-memory cluster computing engineillustrating how multiple runtimes are used to process instructionsprovided in multiple languages, according to an embodiment. Thedistributed data framework 100 may receive instructions in a languagedifferent from the primary language. For example, if the primarylanguage for interacting with the distributed data framework 100 isJAVA, a user may provide a statement in the R language.

In an embodiment, users can interact with the distributed data framework100 using a set of language agnostic APIs supported by the distributeddata framework 100. The language agnostic APIs allow users to providethe required parameters and identify a method/function to be invokedusing the primary language. The distributed data framework 100 receivesthe parameters and the method/function identifier and provides these tothe primary runtime 1120. The primary runtime 1120 invokes theappropriate method/function using the provided parameter values. Theprimary runtime 1120 provides the results by executing themethod/function. The distributed data framework 100 provides the resultsto the caller for display via the document used to send the request.

The distributed data framework 100 may receive instructions in alanguage other than the primary language of the distributed dataframework 100 (referred to as a secondary language). For example, thedistributed data framework 100 may receive a request to process afunction that is available in the secondary language but not in theprimary language. For example, the R language supports several functionscommonly used by data scientists that may not be supported by JAVA (ornot available in the set of libraries accessible to the primary runtime1120.

The in-memory cluster computing engine 220 starts a secondary runtime1220 that is configured to execute instructions provided in thesecondary language. The secondary runtime 1220 is started on eachcompute node 280 that has a DDF segment 1110 for the DDF beingprocessed. Each compute node 280 transform the data structurerepresenting the DDF segment 1110 conforming to the primary language toa data structure representing the DDF segment 1210 conforming to thesecondary language.

For example, if the primary runtime is a JAVA virtual machine and thesecondary runtime is a R runtime, the compute node transforms a DDFsegment represented as a list of byte buffers (representing aTablePartition structure conforming to the JAVA language representation)to a list of vectors in R (representing a DataFrame structure of Rlanguage). Furthermore, the compute node performs appropriate data typeconversions, e.g. the compute node converts a TablePartitionColumniterator of Integer to an R integer vector, Java Boolean to Rlogical vector, and so on. Furthermore, the compute node encodes anyspecial values based on the target runtime, for example, the computenode converts floating point NaN (not a number special value) to R's NAvalue (not-available value) while converting to an R representation, orto Java null pointers while converting to a Java representation. If thesecondary runtime is based on a Python, the compute node converts theDDF segment to a DataFrame representation of Python language.

In an embodiment, the primary runtime 1120 (of each compute node havinga DDF segment of the DDF being processed) executes instructions thattransform the DDF segment 1110 representation (conforming to the primarylanguage) to a DDF segment representation conforming to the secondarylanguage). The primary runtime 1120 uses certain protocol to communicatethe transformed DDF segment representation to the secondary runtime1220. For example, the primary runtime 1120 may open a pipe (or socket)to communicate with the process of the secondary runtime 1220. Thetransformed DDF segment representation is stored in the secondaryruntime 1220 as DDF segment 1210. The secondary runtime 1220 performsthe processing based on the DDF segment 1210 by executing the receivedinstructions in the secondary language.

The processing performed by the secondary runtime 1220 may result ingeneration of a new DDF (that is distributed as DDF segments across thecompute nodes.) Accordingly, each secondary runtime 1220 instance storesa DDF segment corresponding to the generated DDF. The generated DDFsegment stored in the secondary runtime 1220 conforms to the secondarylanguage. The secondary runtime 1220 transforms the generated DDFsegment to a transformed generated DDF segment that conforms to theprimary language. The secondary runtime 1220 sends the transformedgenerated DDF segment to the primary runtime 1120. The secondary runtimestores the transformed generated DDF segment for processing instructionsreceived via the document in the primary language.

Alternatively the processing performed by the secondary runtime 1220 mayresult in modifications to the stored DDF segment 1210. The modified DDFsegment conforms to the secondary language. The secondary runtime 1220sends the transformed modified DDF segment to the primary runtime 1120.The secondary runtime stores the transformed modified DDF segment forprocessing instructions received via the document in the primarylanguage. This mechanism allows the distributed data framework 200 toprocess instructions received for processing the DDF in languages otherthan the primary language of the distributed data framework 200.Accordingly, embodiments allow the DDF to be mutated using a secondarylanguage. The distributed data framework 200 allows further processingto be performed using the primary language. Accordingly, a user can mixinstructions for processing a DDF in different languages in the samedocument.

In an embodiment, the document for processing the DDF in multiplelanguages is shared, thereby allowing different users to provideinstructions in different languages. In another embodiment, the same DDFis shared between different documents. The DDF may be processed usinginstructions in different languages received from different documents.Accordingly, the distributed data framework 200 may modify a DDF basedon instructions in one language and then receive queries (or statementsto further modify the DDF) in a different language. Embodiments cansupport multiple secondary languages by creating multiple secondaryruntimes, one for processing instructions of each type of secondarylanguage.

FIG. 13 shows an interaction diagram illustrating the processing of DDFsbased on instructions received in multiple languages, according to anembodiment. The distributed data framework 200 receives instructions indifferent languages from documents edited by users via client devices130. The instructions are received by the in-memory cluster computingengine 220 and sent to each compute node that stores a DDF segment ofthe DDF being processed by the instructions. In other embodiments, theremay be different components/modules involved in the processing(different from those shown in FIG. 13). Also, there is a primaryruntime 1120 and a secondary runtime 1220 on each compute node on whicha DDF segment of the DDF being processed is stored.

The client device 130 sends 1310 instructions in the primary language tothe primary runtime 1120 of each compute node storing a DDF segment1110. The primary runtime 1120 receives the instructions in the primarylanguage from the client device 130 and processes 1315 them using theDDF segment. The primary runtime 1120 sends 1320 the results back to theclient device 130. Note that the results may be sent via differentsoftware modules, for example, the primary runtime 1120 may send theresults to the in-memory cluster engine 220, the in-memory clustercomputing engine 220 may send the results to the analytics framework 230which in turn may send the results to the client device 130. Forsimplicity, the client device 130 is shown interacting with the primaryruntime 1120. The processing 1315 of the instructions may cause the DDFto mutate such that subsequent instructions process the mutated DDF.

The client 130 subsequently sends 1325 instructions in the secondarylanguage 1325. For example, the instructions may include a call to abuilt-in function that is implemented in the secondary language and notin the primary language. The primary runtime 1120 transforms 1330 theDDF segment stored in the compute node of the primary runtime 1120 to atransformed DDF segment that conforms to the secondary language. Theprimary runtime 1120 sends 1335 the transformed DDF segment to thesecondary runtime 1220.

The secondary runtime 1220 processes 1340 the instructions in thesecondary language using the transformed DDF segment. The processing1340 may generate a result DDF. The result DDF may be a new DDF segmentgenerated by processing 1340 the instructions. Alternatively the resultDDF segment may be a mutated form of the input DDF segment.

The secondary runtime 1220 transforms the result DDF to a format thatconforms to the primary language. The secondary runtime 1220 sends 1350the transformed result DDF to the primary runtime 1120. The primaryruntime 1120 stores the transformed result for further processing, forexample, if subsequent instructions based on the result DDF arereceived. The primary runtime 1120 sends 1335 any results based on theprocessing 1340 to the client device (for example, any result code,aggregate values, and so on).

As shown in FIG. 13, the client device 130 sends 1360 furtherinstructions in primary language for processing using the result DDF.The primary runtime 1120 processes 1365 the received instructions usingthe result DDF. The primary runtime 1120 sends 1370 any results based onthe processing 1365 back to the client device 130.

The distributed data framework 200 runtime automatically select the bestrepresentation of data for in-memory storage and algorithm executionwithout user's involvement. By default, a compressed columnar dataformat is used which is optimized for analytic queries and univariatestatistical analysis. When a machine learning algorithm is invoked, thedistributed data framework 200 performs conversion that is optimized forsuch algorithm, e.g. in a linear regression command, a conversion isperformed by the distributed data framework 200 to extract values fromselected columns and build a matrix representation. The distributed dataframework 200 caches the matrix representation in memory for theiterative machine learning process. The distributed data framework 200deletes the matrix representation from the cache (i.e., uncaches) thematrix representation when the algorithm is finished.

The distributed data framework 200 provides an extensible framework forproviding support for different programming languages. The distributeddata framework 200 receives from a user, software modules for performingconversions of data values conforming to format of one language toformat of a new language. The distributed data framework 200 furtherreceives code for runtime of the new language. The distributed dataframework 200 allows code blocks to be specified using the new language.As a result the distributed data framework 200 can be easily extendedwith support for new languages without requiring modifications to thecode for existing languages.

Alternative Embodiments

It is to be understood that the Figures and descriptions of the presentinvention have been simplified to illustrate elements that are relevantfor a clear understanding of the present invention, while eliminating,for the purpose of clarity, many other elements found in a typicaldistributed system. Those of ordinary skill in the art may recognizethat other elements and/or steps are desirable and/or required inimplementing the embodiments. However, because such elements and stepsare well known in the art, and because they do not facilitate a betterunderstanding of the embodiments, a discussion of such elements andsteps is not provided herein. The disclosure herein is directed to allsuch variations and modifications to such elements and methods known tothose skilled in the art.

Some portions of above description describe the embodiments in terms ofalgorithms and symbolic representations of operations on information.These algorithmic descriptions and representations are commonly used bythose skilled in the data processing arts to convey the substance oftheir work effectively to others skilled in the art. These operations,while described functionally, computationally, or logically, areunderstood to be implemented by computer programs or equivalentelectrical circuits, microcode, or the like. Furthermore, it has alsoproven convenient at times, to refer to these arrangements of operationsas modules, without loss of generality. The described operations andtheir associated modules may be embodied in software, firmware,hardware, or any combinations thereof.

As used herein any reference to “one embodiment” or “an embodiment”means that a particular element, feature, structure, or characteristicdescribed in connection with the embodiment is included in at least oneembodiment. The appearances of the phrase “in one embodiment” in variousplaces in the specification are not necessarily all referring to thesame embodiment.

Some embodiments may be described using the expression “coupled” and“connected” along with their derivatives. It should be understood thatthese terms are not intended as synonyms for each other. For example,some embodiments may be described using the term “connected” to indicatethat two or more elements are in direct physical or electrical contactwith each other. In another example, some embodiments may be describedusing the term “coupled” to indicate that two or more elements are indirect physical or electrical contact. The term “coupled,” however, mayalso mean that two or more elements are not in direct contact with eachother, but yet still co-operate or interact with each other. Theembodiments are not limited in this context.

As used herein, the terms “comprises,” “comprising,” “includes,”“including,” “has,” “having” or any other variation thereof, areintended to cover a non-exclusive inclusion. For example, a process,method, article, or apparatus that comprises a list of elements is notnecessarily limited to only those elements but may include otherelements not expressly listed or inherent to such process, method,article, or apparatus. Further, unless expressly stated to the contrary,“or” refers to an inclusive or and not to an exclusive or. For example,a condition A or B is satisfied by any one of the following: A is true(or present) and B is false (or not present), A is false (or notpresent) and B is true (or present), and both A and B are true (orpresent).

In addition, use of the “a” or “an” are employed to describe elementsand components of the embodiments herein. This is done merely forconvenience and to give a general sense of the invention. Thisdescription should be read to include one or at least one and thesingular also includes the plural unless it is obvious that it is meantotherwise.

Upon reading this disclosure, those of skill in the art will appreciatestill additional alternative structural and functional designs for asystem and a process for displaying charts using a distortion regionthrough the disclosed principles herein. Thus, while particularembodiments and applications have been illustrated and described, it isto be understood that the disclosed embodiments are not limited to theprecise construction and components disclosed herein. Variousmodifications, changes and variations, which will be apparent to thoseskilled in the art, may be made in the arrangement, operation anddetails of the method and apparatus disclosed herein without departingfrom the spirit and scope defined in the appended claims.

1. A computer-implemented method, comprising: storing a documentassociated with a distributed data-frame representing an in-memorydistributed data structure distributed across a plurality of processors;sharing the document among a plurality of user accounts; receiving arequest from a first user account of the plurality of user accounts todesignate at least a portion of the document for local editing;receiving, from the first user account, a modification of the documentas a part of the local editing; receiving, from the first user account,to share the local editing to a new set of collaborators; andpropagating the modification of the document to the new set ofcollaborators, the propagating excluding the plurality of user accountsoutside of the new set of collaborators.
 2. The computer-implementedmethod of claim 1, wherein the document comprises one or more chartsvisualizing data obtained by processing a query based on data stored inthe distributed data-frame.
 3. The computer-implemented method of claim1, wherein sharing the document among the plurality of user accountscomprising: allowing the plurality of user accounts to edit the shareddocument, wherein an edit from one of the plurality of user account ispropagated to the plurality of user accounts.
 4. Thecomputer-implemented method of claim 1, further comprising: receiving asecond modification of the document outside the portion of the documentdesignated for local editing; and propagating the second modification ofthe document to the plurality of user accounts include the user accountsoutside of the new set of collaborators.
 5. The computer-implementedmethod of claim 1, further comprising: receiving a request to updatedata stored in the distributed data frame; generating an updateddistributed data frame based on the request; providing a first chartobtained by a first query of the updated distributed data frameincluding the local editing to the new set of collaborators; andproviding a second chart obtained by a second query of the updateddistributed data frame excluding the local editing to the plurality ofuser accounts outside of the new set of collaborators.
 6. Thecomputer-implemented method of claim 1, further comprising: receiving arequest for globally sharing the local editing of the portion of thedocument with the plurality of user accounts; and propagating themodification of the document to the plurality of user accounts includingthe user accounts outside of the new set of collaborators.
 7. Thecomputer-implemented method of claim 1, further comprising: making acopy of the portion of the document designated for local editing; andresponsive to receiving the modification of the document, performing themodification on the copy.
 8. The computer-implemented method of claim 1,wherein the in-memory distributed data structure is a machine learningmodel, and wherein a request to update data stored in the distributeddata-frame comprises a request to train the machine learning model. 9.The computer-implemented method of claim 1, further comprising:receiving a second modification of the document form a user accountoutside of the new set of collaborators; propagating the secondmodification of the document to the plurality of user accounts includingthe new set of collaborators and other user accounts outside of the newset of collaborators.
 10. The computer-implemented method of claim 1,wherein each user account is associated with a client device.
 11. Anon-transitory computer readable medium for storing computer codecomprising instructions, when executed by one or more processors,causing the one or more processors to perform steps comprising: storinga document associated with a distributed data-frame representing anin-memory distributed data structure distributed; sharing the documentamong a plurality of user accounts; receiving a request from a firstuser account of the plurality of user accounts to designate at least aportion of the document for local editing; receiving, from the firstuser account, a modification of the document as a part of the localediting; receiving, from the first user account, to share the localediting to a new set of collaborators; and propagating the modificationof the document to the new set of collaborators, the propagatingexcluding the plurality of user accounts outside of the new set ofcollaborators.
 12. The non-transitory computer readable medium of claim11, wherein the document comprises one or more charts visualizing dataobtained by processing a query based on data stored in the distributeddata-frame.
 13. The non-transitory computer readable medium of claim 11,wherein sharing the document among the plurality of user accountscomprising: allowing the plurality of user accounts to edit the shareddocument, wherein an edit from one of the plurality of user account ispropagated to the plurality of user accounts.
 14. The non-transitorycomputer readable medium of claim 11, wherein the steps furthercomprise: receiving a second modification of the document outside theportion of the document designated for local editing; and propagatingthe second modification of the document to the plurality of useraccounts include the user accounts outside of the new set ofcollaborators.
 15. The non-transitory computer readable medium of claim11, wherein the steps further comprise: receiving a request to updatedata stored in the distributed data frame; generating an updateddistributed data frame based on the request; providing a first chartobtained by a first query of the updated distributed data frameincluding the local editing to the new set of collaborators; andproviding a second chart obtained by a second query of the updateddistributed data frame excluding the local editing to the plurality ofuser accounts outside of the new set of collaborators.
 16. Thenon-transitory computer readable medium of claim 11, wherein the stepsfurther comprise: receiving a request for globally sharing the localediting of the portion of the document with the plurality of useraccounts; and propagating the modification of the document to theplurality of user accounts including the user accounts outside of thenew set of collaborators.
 17. The non-transitory computer readablemedium of claim 11, wherein the steps further comprise: making a copy ofthe portion of the document designated for local editing; and responsiveto receiving the modification of the document, performing themodification on the copy.
 18. The non-transitory computer readablemedium of claim 11, wherein the in-memory distributed data structure isa machine learning model, and wherein a request to update data stored inthe distributed data-frame comprises a request to train the machinelearning model.
 19. The non-transitory computer readable medium of claim11, wherein the steps further comprise: receiving a second modificationof the document form a user account outside of the new set ofcollaborators; propagating the second modification of the document tothe plurality of user accounts including the new set of collaboratorsand other user accounts outside of the new set of collaborators.
 20. Thenon-transitory computer readable medium of claim 11, wherein each useraccount is associated with a client device.